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Abstract 

In astrophysics a common goal is to infer the flux distribution of populations of scientifically 
interesting objects such as pulsars or supernovae. In practice, inference for the flux distribution 
is often conducted using the cumulative distribution of the number of sources detected at a 
given sensitivity. The resulting "log(A > S) — log(5)" relationship can be used to compare 
and evaluate theoretical models for source populations and their evolution. Under restrictive 
assumptions the relationship should be linear. In practice, however, when simple theoretical 
models fail, it is common for astrophysicists to use pre-specificd piecewise linear models. This 
paper proposes a methodology for estimating both the number and locations of "breakpoints" 
in astrophysical source populations that extends beyond existing work in this field. 

An important component of the proposed methodology is a new Interwoven EM Algorithm 
that computes parameter estimates. It is shown that in simple settings such estimates are 
asymptotically consistent despite the complex nature of the parameter space. Through simula- 
tion studies it is demonstrated that the proposed methodology is capable of accurately detecting 
structural breaks in a variety of parameter configurations. This paper concludes with an appli- 
cation of our methodology to the Chandra Deep Field North (CDFN) dataset. 

Keywords: Broken power law, CDFN X-ray survey, Interwoven EM algorithm, Likelihood com- 
putations, log N — log S, Pareto distribution 

1 Introduction 

The relationship between the number of sources and the threshold at which they can be detected 
is an important tool in astrophysics for describing and investigating the properties of various types 
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of source populations. Known as the log N — logS relationship, the idea is to use the number of 
sources A(> S) that can be detected at a given sensitivity level S, on the log-log scale, to describe 
the distribution of source fluxes. In simple settings and under restrictive assumptions a linear 
relationship between the log-flux and the log-survival function can be derived from first principles. 
Traditionally astrophysicists have therefore examined this relationship by characterizing the slope 
of the log of the empirical survival function as a function of the log- flux of the sources. 



Examples of log N — log S analyses include Guetta et al. ( 2005 ) , who use the relationship for 
Gamma Ray Bursts (GRBs) to constrain the structure of GRB jets. By comparing the log A — 
log S relationship for observed data to the predicted log N — log S relationship under different 
physical models for GRB jets, the authors are able to uncover limitations in the physical models. 
The log A — log S curves have also been used to constrain cosmological parameters using cluster 



number counts in different passbands; see, e.g., Mathiesen and Evrard (1998) and Kitayama et al. 



( |1998 ) . Other applications of log A — log S modeling include the study of active galactic nuclei 
(AGNs). For example, Mateos et al. (2008) use the log N — logS relationship over different X-ray 
bands to constrain the population characteristics of hard X-ray sources. 

In the probability domain under independent sampling, the linear log A — log S relationship 
corresponds to a Pareto distribution for the source fluxes, known to astrophysicists as a power- law 
model. Despite the unrealistic assumptions in the derivation, the linear log iV — log S relationship 



does have strong empirical support in a variety of contexts; e.g., Renter and Murray (2003). In 
addition to its simplicity the power-law model also retains a high degree of interpretability, with 
the power-law exponent often of direct scientific interest. As a result, of this simplicity and inter- 
pretability, the power-law model forms the basis of most log A — log S analyses despite its many 
practical limitations in the ability to fit more complex datasets. 

To address the limitations of this simple model astrophysicists have also experimented with 
a variety of broken power-law models. This is particularly important for larger populations or 



populations of sources spread over a wide energy range. Mateos et al. (2008) illustrate this by 
using both a two- and three-piece broken power-law model to capture the structure of the log A — 
log S distribution across a wide range of energies. The basic idea of broken power-law models is to 
relax the assumption that the log survival function is a linear function of the log flux, and to instead 
assume a piecewise linear function. This adds additional challenges in estimating the location of the 
breakpoint, and quantifying the need for the breakpoint model above the simpler single power-law 
model. While recognizing the need to have more flexible models for log A — log S analyses, most 
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of the work in this area does not provide a coherent means to selecting the location and number of 
breakpoints. 

In this paper we provide an automatic method for jointly inferring the number and location of 
breakpoints and the parameters of interest for the log N — log S problem. Our method allows astro- 
physicists to reliably infer both the number and the location of breakpoints in the log N — log S re- 
lationship in a statistically rigorous manner for the first time. This simultaneous fitting introduces 
new computational challenges, so our method utilizes a new extension of the EM algorithm, known 



as the Interwoven EM Algorithm (IEM) (jBainesJ |2010t |Baines et al\ [2012J). The IEM algorithm 
provides efficient and stable estimation of the model parameters across a wide range of parameter 
settings for a fixed number of breakpoints. To determine the number of breakpoints we then use 



an additional model selection procedure that employs the power posterior technique of Friel and 



Pettitt (2008) to accurately compute the log-likelihood of the candidate models. 



The remainder of the paper is organized as follows. In Section [2] we introduce the necessary 
background and statistical formulation of the log iV — log S model. Section [3] provides details of 
our estimation procedure for a fixed number of breakpoints, with Section [4] outlining our model 
selection procedure to determine the number of breakpoints required. The performance of our 
method in terms of both parameter estimation and identification of the number of breakpoints is 
detailed in Section [5j An application to data from the Chandra Deep-Field North X-ray survey is 
provided in Section [6} Large-sample theory is developed in Section [7] and concluding remarks are 
offered in Section [81 Lastly technical details are deferred to the appendix. 



2 Background and Problem Specification 

Let S = (Si, . . . , S n ) T denote a vector of the fluxes (in units of ergs s _1 cm~ 2 ) of each of a 
population of n astrophysical sources. For example, we may be interested in the flux distribution 
of a selection of n X-ray pulsars located in a specified region of sky at a specified distance. The 
basic building block of our method is the power-law model: 

n 

N(>S) = J2hs i >s}-<xS-P, S>r. (1) 

i=l 

This specifies that the unnormalized survival function iV(> S) is approximately a power of the flux 
S. The power-law exponent, (5, is the parameter of primary interest and provides domain specific 
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knowledge about the source populations. The lower threshold r can either be fixed according to 
the desired sensitivity level, or estimated from the data. Equivalently, taking the logarithm of both 
sides, assumes a linear relationship between log(iV(> S)) and log(S): 

log(iV(> S)) ~ log(a) - P log(S), S > r. (2) 



In a statistical context, the theoretical power-law assumption corresponds to assuming that the 
source fluxes follow a Pareto distribution: 

Si ~ Pareto(/3, r), i = l,...,n. 



In practice, the linear log N — log S, or Pareto, assumption is not sufficient to describe the log N — 
log S relationship for many real datasets. There are several ways to generalize 0, the most popular 
among astrophysicists being the broken power-law model as illustrated in Jordan et al.\ ( |2004[ ) and 
Cappelluti et al. (2007). The starting point of the broken power-law is to replace ([I]) with a 



log(JV(> S)) 



monotonically decreasing piecewise linear approximation. In the case of a two-piece model we 
assume: 

log(Qi)-/3ilog(5), n<S<T 2 

(3) 

^log(a 2 ) - log(S), S > t 2 , 

where (3\ and f3 2 are parameters of interest. Note that as a result of the continuity and normalization 
constraints on Ti,T2,ai,a 2 , (3\ and f3 2 there are a total of 4 free parameters in this expanded two- 
piece model. Applications of the broken power-law model in the astrophysics community typically 



use either fixed numbers and locations of the breakpoint (s) or selection via ad hoc procedures (Tru- 



dolyubov et al, 2002). The contribution of this paper is the proposal of an automatic procedure for 



selecting the number and estimating the locations of the breakpoints jointly with the parameters 
of interest. 



2.1 Hierarchical modeling of the log iV — \ogS relationship 

We now describe the connection between the broken power-law model introduced in ([3]) and the 
observed data. In practice the flux of each source, Si, is not observed directly. Instead, we observe 
a Poisson-distributed photon count whose intensity is a known function of the parameter Si- Let 
Y\,Y 2 ,...,Y n denote the source counts, then we assume the following hierarchical model. For 



4 



i = l,...,n, 



Yi\Si, . . . , S n m ^S p ' Poisson(y4jS'i + bi) and 



iid 



Si ~ Pareto B (/3,T), 



where A^s and fej's are known constants (see below), (3 = (fii, . . . , (3b) > 0, t = (ti, . . . , tb) such 
that tb > • • ■ > Ti > 0, and Paretos(/3, r) represents a B-piece Pareto distribution with survival 
distribution 

1, X < T\ 



S B (x) 



n^(^) ft }(?) 



Tl < X < T2 

r 2 < x < r 3 
x > r B 

and thus its distribution function Fb(-) = 1 — Sb(-)- Note that the i?-piece Pareto distribution 
corresponds to the broken power-law. The probability density fs can be easily found by differen- 
tiation. When B = 1, the S-Pareto distribution reduces to a Pareto distribution with probability 
density function 

( o. a 

X > T. 



J3t± 



;/3 + l ! 

0, X < T. 



In the above A^s, sometimes known as effective areas, represent sensitivities of the detector, while 
fei's represent background intensities. With the above model the goal is then to estimate B and, 
at the same time, (3 and r. At first sight, this seems to be a straightforward statistical problem: 
for a fixed B the maximum likelihood idea can be adopted to estimate (3 and r, while the issue 
of choosing B can be viewed as a model selection problem and thus traditional ideas such as AIC 
and BIC can be used. However, as to be seen below, practical implementation of these ideas poses 
serious computational challenges that cannot be easily solved. 



3 Maximum Likelihood Estimation When B Is Known 

In this section we provide details of how to obtain maximum likelihood estimates of (3 and r for a 
fixed number of breakpoints B in the log N — log S model. Defining fi$ = 0, tq = T\ and tb+i = oo, 
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the likelihood is 



L(J3,T;Y lt ...,Y n 



YA 



Note that the likelihood involves some numerically unstable integrals that do not have a closed 
form solution, and hence a direct maximization is extremely difficult. To further appreciate this 
difficulty, consider the case when there is no background contamination (pi = 0), for which the 
above likelihood degenerates to 



n 

i=l 



B 

£ 



YA 



{T(Yi - faAiTj) - T(Yi - Pj^AiTj+x)} 



Here, T(a,x) = t a ~ 1 e~ t dt is the gamma function which is numerically unstable, particularly 
when the first argument is large. Together with the inner summation in the above expression, these 
issues make a direct maximization of the (log-)likelihood difficult even when there is no background 
contamination. To address these issues we propose an EM-algorithm ( |Dempster et al. 1977) to 
find the maximum likelihood estimators of (3 and t for the general case of b% > 0. 



3.1 EM with Sufficiency Augmentation Scheme 



The EM algorithm (Dempster et a/. , 1977) has long been popular for its monotone convergence 



and resulting stability, and is therefore well-suited to our context. As always, the EM algorithm 
must be formulated in terms of "missing data" or auxiliary variables, that must be integrated out 
to obtain the observed data log-likelihood. For the current problem, since we are interested only 
in inference for j3 and r, marginalizing over the uncertainty in the individual fluxes, it is natural 
to treat S = (Si, . . . , S n ) T as the missing data. Since S is a sufficient statistic for = (/3, r) T , we 
call this the sufficient augmentation (SA) scheme in the terminology of |Yu and Meng ( 2011| ). 
Let Y = (Yi, . . . , Y n ) T . The complete data log-likelihood of (Y, S) is 



logp(Y, S; (3, t) = ^gg(Y l ; + h) + ^ log f B (S l ;f3, r), 



i=i 



i=i 



where g(x\ fi) is the probability mass function of a Poisson distribution with mean fi. In the E-Step 
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of the algorithm we compute the conditional expectation 



Q(6>|6»( fe ))=E{logp(Y,S;6>)|Y;0«} 

n n 

= E {log gtXi ; AiSi + bi) \Yi ; 0« } + ]T E {log / B (5, ; 0) |y i5 ^ } , (4) 

1=1 i=l 

where 0( fc ) denotes the estimate of at the fc-th iteration. The M-step of the algorithm must then 
maximize Q{0\G^) with respect to 6. Since the first term of Q does not depend on 0, it can be 
ignored in our maximization. For the second term, as it does not admit a closed form expression, 
a Monte Carlo method is used to approximate it. The basic idea is to estimate it by the mean of 
a suitable Monte Carlo sample of the Si's as described in Algorithm [TJ 

Without the first term in Q, the maximization of Q(6\6^) is equivalent to finding the MLE 
of = ((3,t) t from an iid sample X = (Xi, . . . ,X m ) from the Paretos(/3, r) distribution. The 
log-likelihood of X is 

B B B m 

1(6; X) = J2 Pi ( n j lo S T j - n i+i lo § T i+i) + m j lo S Pj ~ ^ Pj ^ log Xi - ^ log Xi, 

j=l j=l j=l i&Aj i=l 

where rij = card{i : Xi > Tj}, ub+i = 0, rrij = nj+i — rij, tb+i = oo, n^+i logrs+i is defined to 
be 0, and Aj = {i: Tj < Xi < t j+ i}. Note that the n^s and m^s are functions of r. For any fixed 
r, straightforward algebra shows that 1(6; X) is maximized when f3j is set to 



Pj( T ) = m j( T ) I 2^ log^i + n j+l (r) logr j+ i - nj(r) logTj ) , j = 1, . . . , B. (5) 

\ieAj 

By substituting the above expression, 1(6; X) becomes 

-m B 

l(6;X) = -m-Y J ^gX l + Y J ™j (t) log & (r). (6) 

i=i i=i 



Therefore, to obtain the MLE for 6 = ({3,t) t from X, one can first maximize 1(6; X) in ^ with 
respect to r, and then plug in the corresponding maximizer f (i.e., the MLE of r) into Q to 
obtain the MLE for (3. 

The MLE of t\ is fi = min(Xi, . . . , X m ), while unfortunately the MLEs for T2,...,tb do 
not admit closed-form expressions. Further, Q is not a continuous function in r and therefore 
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traditional optimization methods that require function derivatives (e.g., Newton-like methods) 
cannot be applied here. We have experimented with various optimization algorithms and found 
that the Nelder-Mead algorithm works well for this problem. The major steps of the EM algorithm 
in the SA scheme (SAEM) for finding the MLEs of are given in Algorithm [T] In practice, the 



SAEM algorithm often converges very slowly. Section 3.4 below provides some illustrative numerical 
examples. 

Algorithm 1 SAEM: EM with the Sufficient Augmentation Scheme (SAEM) 

1. Choose a starting value 6^°' and set k = 0. 

2. Generate S^ 1 S^ NahX1 ' from p(S|Y;#( fc )) using the following Metropolis-Hastings algo- 
rithm. For each simulation of S, we sample the elements of S one at a time. Suppose 
S = (Si, ... ,S n ) is the current draw. Denote S* = (Si, . . . , Sj-i, S*, Sj + i, ... ,S n ), where S* 
is drawn from Paretos(/3^ fc \ t^). We accept this S* as new value with probability dj(S, S*); 
otherwise, we retain S. The acceptance probability is given by 

//^:,1 ; S: • /,/) 



3. Find the maximizer 6 of the Monte Carlo estimate of Q(0\6^). This is equivalent to com- 
puting 

= argmax — V V log f B (sf } ; 6) , 

s=N buln +l 1=1 

where -/Vb urn is the number of burn-in. As discussed above, 6 can be obtained by the following 
steps: 

(a) set fi = min{sf ^ : i = 1, . . . , n, s = N hurn + 1, . . . , iV sim }, 

(b) obtain T2,...,tb as the maximizer of J2f=i m j( T *) Pj{ T *)i where r* = 
(fi, T2, . . . , Tb) ) using the Nelder-Mead algorithm, and 



(c) set /3j = (5j{f) using (J5J), for j = 1, . . . , i?. 

4. Set 0( fc+1 ) = 0. 

5. Repeat Steps 2 to 4 until convergence. 



3.2 EM with Ancillary Augmentation Scheme (AAEM) 

Given the slow convergence of the SAEM algorithm, we seek faster alternatives. This subsection 
proposes an alternative EM algorithm that is based on an ancillary augmentation (AA) scheme, 
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called the AAEM algorithm. For a discussion of augmentation schemes and their use in EM, 
Baines et al. ( |2012 ) . The basis of our AAEM is to re-express our model using auxiliary variables 



sec 



U i = F B (S i ;0): 



Yi\Uu...,U n m ~ P ' Poisson(A i F~ 1 (C/ i ; 0) + h) and 



Ui ~ Uniform(0,l), 

for i = 1, . . . , n. Here U = (f7i, . . . , f/ n ) is treated as the missing data, and preserves the observed 
data log-likelihood. In the E-Step we then calculate the conditional expectation 

n 

Q(0|0«) = {loggiYi; A t F B \Ui- 6) + bi)\Y; 0«} . (7) 



i=i 



This conditional expectation can be approximated and maximized in a similar manner as for the 
Q(6\9^) in the SAEM algorithm. The resulting AAEM algorithm is summarized in Algorithm [2j 
Section |3.4| provides some empirical comparisons between the AAEM and SAEM algorithms. As 
may be expected, there are some situations where the AAEM algorithm converges faster, while 
there are other situations where the SAEM algorithm converges faster. 

3.3 Interwoven EM (IEM) 

In practice, choosing the most efficient algorithm between the SAEM and AAEM requires knowledge 
of the unknown parameter values and the theoretical convergence rates, both of which are not 
available in most contexts. Therefore, it would instead be desirable if one could combine the "best 
parts" of SAEM and AAEM rather than select one of them. One simple way to combine the 
two algorithms is to use the so-called alternating EM (AEM) algorithm. The AEM algorithm 
proceeds by using SAEM for the first iteration, then uses AAEM for the second iteration, followed 
by SAEM for the third, and so on. While this procedure tends to "average" the performance 
of the two algorithms, a more sophisticated way to combine them is to use the Interwoven EM 



(IEM) algorithm of Baines et al. (2012). Theoretical and empirical results show that IEM typically 
achieves sizeable performance gains over the component EM algorithms. The key to the boosted 
performance of IEM is that it utilizes the joint structure of the two augmentation schemes through a 
special "IE-Step". In contrast, AEM simply performs sequential updates using each augmentation 



scheme that make no use of this joint information. The theory of the IEM algorithm in Baines et al 
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Algorithm 2 AAEM: EM with Ancillary Augmentation Scheme 



1. Choose a starting value 6^ and set k = 0. 

2. Generate U (1) , . . . ,U (iVsim) from p(U\Y;0^) using the Metropolis-Hastings algorithm. For 
each simulation of U, we sample the element of U one by one. Let U = (Ui, . . . , U n ) be the 
previous draw. If we denote U* = (U\, . . . , Uj-i, U*, {7j+i, . . . , U n ), where U* is drawn from 
Uniform(0, 1). We accept this U* as new value with probability bj(U, U*); otherwise, we 
retain U. The acceptance probability is given by 

6j(U, U ) = mm < 1, v 



giY^F^iUj-ei^ + bj) ]' 
3. Find the maximizer 6 of the following Monte Carlo estimate of Q(6\Q^)\ 



s=N huln +l 8=1 



The maximization can be done for example with the Nelder-Mead algorithm. 

4. Set 0( fe+1 ) = 6. 

5. Repeat Steps 2 to 4 until convergence. 



( 2012 ) shows that the rate of convergence of IEM is dependent on the "correlation" between the two 
component augmentation schemes. Since the SA and AA schemes typically have low correlation, 
here we interweave these two schemes to produce an IEM algorithm for estimating the parameters 
of flux distributions. 

The IEM algorithm for our log iV — log S model is given in Algorithm [3j The algorithm re- 
quires very minimal computation in addition to the component SAEM and AAEM algorithms so is 
comparable in real-time per-iteration speed. Lastly we note that there is some freedom in how to 
combine the IEM algorithm with MC methods. Specifically, there are variations in how one may 
choose to implement Step 3. One may want to sample U again instead of using the previous samples 
in Step 2. In both cases, one obtains a sample from U^, #( fc + - 5 ) and achieves the goal. From our 
practical experience, we found that there is very little difference between the performances of these 
two approaches. Thus, we choose to use the one which is least computationally expensive. 
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Algorithm 3 IEM: Interwoven EM 



1. Choose a starting value 6^ and set k = 0. 

2. Execute Steps 2 and 3 of the SAEM algorithm. Set 0( fc+0 - 5 ) = 0. 

3. Execute Step 3 of the AAEM algorithm, with U w generated as: uf> = F B {sf; 0( fc+0 - 5 )), for 
j = 1, . . . , n and I = N hnrQ + 1, . . . , iV sim . Set 6>( fc+1 ) = 6. 

4. If convergence is achieved or k attains iVjijnit, then declare to be MLE; otherwise set 
k = k + 1 and return to Step 2. 



3.4 An Empirical Comparison Amongst Different EM Algorithms 

In this subsection we empirically compare the convergence speeds of SAEM, AAEM, AEM and 
IEM by applying them to two simulated data sets. These two data sets were simulated from a 
model with B = 1 and no background contamination counts. This model is somewhat simple but 
the advantage is that the likelihood function simplifies considerably, and the corresponding maxi- 
mum likelihood estimates can be reliably obtained with non-EM methods. With these maximum 
likelihood estimates the maximized log-likelihood value can be calculated and used for baseline 
comparisons. 

In Figure [TJja) , for the first simulated data set, we plot the negative log- likelihood values of the 
SAEM, AAEM, AEM and IEM estimates evaluated at different iterations. One can see the slow 
convergence speeds of SAEM and AAEM, with SAEM being the slower. Also, both AEM and IEM 
converged relatively fast, with IEM being the faster. When comparing to AEM, IEM utilizes the 
relationship between SAEM and AAEM at each step, which leads to the superiority of IEM over 
AEM. 

We repeat the same plot in Figure [T]^b) for the second simulated data set. This time the relative 
speeds of SAEM and AAEM switched; i.e., SAEM converged faster. This illustrates the fact that 
none of SAEM and AAEM is superior to the other. The relative position of AEM and IEM remain 
the same. 

Overall from these two plots one can see that the IEM algorithm is the most efficient and 
robust. Also, when comparing to AEM, it is computationally faster due to the skipping of an 
extra sampling step. Similar performance was observed across a wide range of simulation settings. 
Therefore we recommend using the IEM algorithm to compute the maximum likelihood estimates 
when B is known. 
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(a) Simulated Data Set 1 



(b) Simulated Data Set 2 



Figure 1: Plots of negative log-likelihood values for different EM algorithms. In each plot the 
horizontal dashed line indicates the negative log-likelihood evaluated at the maximum likelihood 
estimates. 



4 Automated Choice of B 

This section addresses the important problem of selecting the number of "pieces", B, in the broken- 
Pareto model. Since this problem can be seen as a model selection problem, one can adopt well 
studied methods such as AIC and BIC to solve it. To proceed we first note that when B = 1, the 
number of free parameters in the model is 2B. With AIC, the best B is chosen as 

B A ic = argmax AIC(5) = argmax {-21og£(/3, f; Y x , . . . , Y n ) + 4b) , 
B B *- > 

while for BIC B is chosen as the minimizer of 



-Bbic = argmax BIC(-B) = argmax \ — 21ogL(/3, r; Y\, . . . , Y n ) + 2B logn > . 

B B ^ J 

Despite the straightforward definitions, in practice the numerical instability of the likelihood func- 
tion makes computation of AIC(-B) and BIC(-B) very challenging. To address this problem we 



adopt the so-called power posterior method proposed by Friel and Pettitt (2008) to approximate 
the log-likelihood directly. 
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In our context, the power posterior is denned as 



Pt(S|Y; 6) oc p(Y|S)*p(S; 0) for < t < 1. 



In addition, define 



z 



(Y\t) = [ p(Y\s) t p(s;0)ds 



and, for simplicity, write the likelihood as p(Y) = L(f3, t; Y\, . . . ,Y n ). The following equality is 
crucial to this method: 



where the last expectation (inside the integral) is taken with respect to the power posterior 
pt(S[Y;0). The idea is as follows. First, for any given t, Monte Carlo methods can be applied to 
sample from the power posterior and approximate the expectation. Once a sufficient number of 
these expectations (corresponding to different values of t) are calculated, numerical methods can be 
used to approximate the integral, which is the same as the log-likelihood. Since this method approx- 
imates the log- likelihood directly (i.e., without the computation of the likelihood), it is numerically 
quite stable. The detailed algorithm is presented as Algorithm |4j 

The above algorithm provides a reliable method for approximating the log-likelihood for a 
given value of 6. Then one natural question to ask is, can we not simply obtain the MLE of 
by directly maximizing this log-likelihood approximation via, say, Newton's method? The answer, 
in principle, is yes, but the IEM algorithm is still preferred mainly because the estimates from 
IEM are generally more stable and reliable. Moreover, the power posterior approximation to the 
log-likelihood is computationally intensive if one wants to obtain an accurate estimate. For these 
reasons, we only use this power posterior approximation to estimate the log-likelihood evaluated 
at the MLE obtained by the IEM algorithm. 

5 Simulation Experiments 

Numerical experiments were conducted to evaluate the practical performance of the proposed 
methodology. Four experimental settings were considered: 

1. B = 1, r = 5 X 1(T 17 , (3 = 1 and n = 100, 
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Algorithm 4 Power Posterior Method for Log-Likelihood Calculation 



1. Choose a starting value and set k = 0. 

2. Set t = (k/N gr id) c , where c controls the density of the grid values of t. It is typically set to 3 



or 5 (see Friel and Pettitt 2008). 



3. Generate S (1) , . . . , S (7Vsim) from pt(S|Y; 0) using the Metropolis-Hastings algorithm described 
in Step 2 of the SAEM algorithm. Note that the acceptance probability becomes 

,«, ■ /, \ 9(Y fi A j S' } +b i ) \'\ 

MS,s ) = ml „| 1 ,| 9(1 , ;AA + M )|. 

4. Estimate E[log{p(Y|S)}|Y;0,i] with 

-i -^V sim 

! * = N . - Nh E 1°SpC*1SW;0). 

1 "sim 1 v burn Ar . , 

5. If k < N grid , set k = k+1, S (0) = Y^7=N huin +i S^V^sim-Aburn), and go to Step 2. Otherwise 
go to the next step. 

6. Given the l^s, the log-likelihood log{p(Y)} can be approximated via any reliable numerical 
integration method. 



2. B = 2,t = (1x 1(T 17 , 5 x 1(T 17 ) T , /3 = (0.5, 3) T and n = 200, 

3. B = 2, r = (1 X lO" 17 , 5 x 10- 17 ) T , /3 = (0.5, 1.5) T and n = 200, 

4. 5 = 3, r = (1 x 10" 17 , 8 x 10~ 17 , 1.8 x 10~ 16 ) T , (3 = (0.3, 1, 3) T and n = 500. 

The parameter values of these settings were chosen to mimic the typical behavior of the real data. 
The effective areas and the expected background counts are set to A\ = 10 19 and bt = 10 respectively 
for all i. 

Two hundred data sets were generated for each experimental setting. For each generated data 
set, both AIC and BIC were applied to choose the value of B, and model parameters were estimated 
by the IEM algorithm. The selected values of B are summarized in Table [TJ One can see that BIC 
works substantially better than AIC for selecting B, and while BIC occasionally overestimates B, 
there is a clear tendency for AIC to consistently overestimate B. 

Other crucial factors that determine the ability of our method to detect structural breaks in 
the population distribution include: (i) the sample size, (ii) the separation between breakpoints, 
and, (iii) the magnitude of the difference between the power-law slopes on adjacent segments. The 
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impact of the third factor can be seen by comparing simulation results from settings [2] and [3j 
where the misclassification rate is seen to increase as the slopes become closer. From additional 
simulations our experience suggests that in typical settings a sample size of 200 or more is needed 
to reliably detect a single breakpoint, with double this required to detect two breakpoints. In 
simulations, true breakpoints can be detected for smaller sample sizes, but at a lower rate that is 
more dependent on the noise properties of the specific simulation. 

In addition to selecting the number of breakpoints, we also conducted a simulation to assess the 
quality of parameter estimation when using the IEM algorithm. For each experimental setting, we 
calculated the squared error (f3\ — $i) 2 of (5\ for all those data sets where B were correctly selected. 
We then computed the average of all these squared errors, denoted as mse(/3i), and calculated the 
relative mean squared error \l mse(/3i) / f3\. Similar relative mean squared errors for other estimates 
in (3 and t were obtained in a similar manner. These relative mean squared errors are given in 
Tabled We note that all of these are of the order of 10 -2 or 10 _1 . 
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Model Selection 




B 






Setting 


Method 


1 


2 


3 


4 


1 


AIC 


94 


53 


35 


18 




BIC 


164 


33 


3 





2 


AIC 





135 


45 


20 




BIC 





198 


2 





3 


AIC 





110 


71 


19 




BIC 





177 


23 





4 


AIC 








138 


62 




BIC 








194 


6 



Table 1: The number of pieces B selected by AIC and BIC. 



6 Application: Chandra Deep Field North X-Ray Data 

We now apply our method to data from the Chandra Deep Field North (CDFN) X-ray survey. 
Our dataset comprises a total of 225 sources with an off-axis angle of 8 arcmins or less and counts 
ranging from 5 to 8655. The full CDFN dataset is comprised of multiple observations at many 
different aimpoints, however we here consider only a subset where the aimpoints are close to each 
other to avoid additional modeling considerations. Since off-axis angle measures the radial distance 
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AIC 
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2.55 x 10" 
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9.81 x 10" 


•2 


1.13 x 10" 
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BIC 


3.52 x 10" 


-2 


2.60 x 10" 


-2 




9.17 x 10" 


2 


1.08 x 10" 


i 




3 


AIC 


3.52 x 10" 


-2 


1.42 x 10" 


1 




1.20 x 10" 


1 


1.32 x 10" 


i 






BIC 


3.57 x 10" 


-2 


1.29 x 10" 


1 




1.11 x 10" 


1 


1.35 x 10" 


i 




4 


AIC 


2.71 x 10" 


-2 


3.26 x 10" 


■2 


5.04 x 10"* 


7.08 x 10" 


•2 


9.91 x 10" 


2 


1.23 x 10 _I 




BIC 


2.72 x 10" 


-2 


3.94 x 10" 


-2 


4.97 x 10" 2 


7.16 x 10" 


2 


9.74 x 10" 


2 


1.19 x 10 _1 



Table 2: The relative mean squared errors of (3 and r, conditional on selection of the correct B. 

of the source from the center of the detector, sources with large off-axis angles can be thought 
of as being "close to the edge of the image" . Sources appearing at large off-axis angles appear 
much larger and at lower resolution than those closer to the center of the detector. The source- 
specific scaling constant, effective area Ai, is used to account for variations in the expected number 
of photons as a function of source location and photon energy. However, at large off-axis angles 
additional complications such as "confusion" (two or more sources overlapping and appearing as 
one) and "incompleteness" (possible non-detections of fainter sources) must be considered. For the 
purposes of our analysis here, we include all sources with an off-axis angle < 8 arcmin to achieve 
a worst-case completeness of 80%. We also consider thresholding at < 6 and < 7 arcmins, with a 



full discussion of the sensitivity to this threshold considered in Section 6.1 



-16.0 -15.5 -15.0 -14.5 -14.0 

Iog10(s) 



Figure 2: log iV — log S plot for the Chandra Deep Field North data with off- axis angle truncation 
at 8 arcmins. The vertical dotted lines are drawn at f\ and T2- The red lines correspond to the 
fitted broken-Pareto model with estimated slopes /3i and fa- 
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Applying our model selection procedure to the dataset with < 8 arcmins yields an estimate of 
B = 2, with B = 1 for the < 6 and < 7 arcmin subsets. As discussed in detail in Section 



6.1 



the consistency of the observations in the 6 — 8 arcmin range suggests that the ability to detect 
the presence of a breakpoint is limited by the small sample sizes at < 6 and < 7. Figure [2] shows 
the log A — log S plot for the < 8 arcmin dataset, depicting the log (base 10) of the empirical 
survival count as a function of the log flux, using the imputed fluxes from the final E-step of our 
algorithm. While the plot ignores the uncertainty in the S^s, it remains the standard plot for the 
analysis of log A — log S relationships. We note from the plot that the "break" is clearly visible 
around log 10 (ri) = —15.657, with a change in slope from 0.48 to 0.85. Full parameter estimates 
and standard error estimates are provided in Table [3| Standard error estimates are obtained using 
a simple Bootstrap resampling procedure. Our analysis shows that a two-piece broken power-law 



model is preferred for this subset, with a breakpoint at a lower flux than shown in Moretti et al. 



(2003), and with the lower segment at a flatter slope. This differs from what would be expected if 



point sources are to make up all of the diffuse background (Hickox and Markevitch 2007), suggesting 
that a significant proportion of the residual X-ray background is composed of diffuse emission; see 



also Mateos et al. (2008). 



Parameter 


Estimate 


SE 


Pi 


0.483 


0.060 


fa 


0.854 


0.224 


lo gio( T i) 


-16.344 


0.030 


lo glo( T 2) 


-15.657 


0.271 



Table 3: Parameter estimates and standard errors for the Chandra Deep Field North (CDFN) 
dataset. 



6.1 CDFN Source Selection 

In this section we consider the sensitivity of our analysis to the chosen off-axis angle threshold. 
As discussed in Section [6j at higher off-axis angles there are additional complications such as 
incompleteness and confusion that must be built into any statistical analysis that are not covered 
by the method presented here. Let K denote the maximum off-axis angle; i.e., all sources with 
off-axis angle less than K are retained, all others are excluded from the analysis. The choice of 
K = 8 for our analysis in Section [6] is motivated by scientific considerations and an estimated 
completeness above 80% at K = 8. However, by varying the truncation point we obtain additional 
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log 


io(*) 


P 




K 


n 


lo gio(n) 


logi (r 2 ) 


Pi 


Pi 


4 


77 


-16.364 




0.788 




5 


112 


-16.353 




0.738 




6 


152 


-16.329 




0.691 




7 


192 


-16.373 




0.590 




8 


225 


-16.343 


-15.668 


0.482 


0.850 


9 


257 


-16.352 


-15.732 


0.449 


0.850 


10 


287 


-16.378 


-15.696 


0.450 


0.792 


11 


298 


-16.389 


-15.702 


0.456 


0.793 


12 


303 


-16.403 


-15.677 


0.454 


0.802 


13 


304 


-16.429 


-15.843 


0.412 


0.743 



Table 4: CDFN Results by varying off-axis truncation 

insight into the sensitivity of our analysis to this decision, as well as to the statistical senstitivity 
to the sample size required for breakpoint detection. Table [4] shows the results of the analysis for 
differing values of K. As explained, results for K > 9 are likely to be untrustworthy, although they 
happen to be similar to those with K = 8. On the other extreme, if we truncate at K = 4 or K = 5 
we unnecessarily discard a large number of sources. 

We note that at K = 7 we are also no longer able to formally detect a break i.e., B = 1. 
However, upon closer examination the BIC values for B = 1 and B = 2 when K = 7 are very 
similar (2186.79 vs. 2188.37), indicating that there is little to choose between the B = 1 and B = 2 
models. With a few additional data points added at K = 8, our procedure then has enough power 
to detect the break at K = 8. It is worth noting that all additional data points with off-axis angle 
between 7 and 8 were manually screened, and are quantitatively very similar to those with K < 7. 
That is, the detection (or lack) of a breakpoint in this context appears to be primarily determined 
by the sample size of the dataset used. This is consistent with our results from the simulation study 
in Section [5j where a sample size of approximately 200 was required to reliably detect a break with 
similar parameter configurations. Indeed, looking at the plot in Figure [2j we note that the break 
is rather a subtle one, with the estimated slopes differing by approximately 0.37. In summary, for 
this particular dataset we note that there appears to be evidence of a breakpoint, although the 
sample size required to detect the breakpoint is not reached until we truncate at K = 8, just before 
additional modeling considerations such as incompleteness must be accounted for. 
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7 Theoretical Properties 



This section deals with the large-sample properties of the proposed procedure. In order to make 
the proofs more accessible, we first focus on the case when B is known, with no background 
contamination (6, = for all i) and all Ai are assumed to be identical. The more general cases will 
be briefly discussed later. 

If it is assumed that Ai = A > for all i = 1, . . . , n, then Yi, . . . , Y n constitute an iid sample 
from the previously stated statistical model. Denote the density of Y\ by 



f(y;0) 



oo e -ju 



(As) 



B 

3=1 V 3 



yi 



-f B (s;/3,r)ds 



T.i-i 



{T{y - faArj) - T(y - fr, Ar j+1 )} 



B 

E 

3=1 



Ar j+1 



t y-P 3 -i e -t dL 



The parameter space is defined as = {0 = (/3, t) t G IR^ 6 : (3j ^ /3j+i,Tj < Tj + i,j = 1, . . . , B — 1}. 
Let 6q = (/3q,to) t G denote the true parameter value. Notice that is not compact and that 
the value of the likelihood does not converge to zero if the parameter approaches the boundary of 



0. Therefore standard arguments such as the ones based on Wald (1949) do not apply directly 



in order to establish strong consistency of the maximum likelihood estimator 9 of 6q. Instead a 



compactification device is applied to subsequently use the results of Kiefer and Wolfowitz (1956). 
This leads to the following result. 

Theorem 1. Suppose B is known and Ai = A > for all i = l,...,n. Then, the maximum 
likelihood estimator is strongly consistent for 6q, that is, 6 — > Oq with probability one as n — > oo. 

The proof of Theorem [T] is provided in Appendix [Aj To weaken the restriction of identical Ai, 
observe that this condition is mainly applied to allow the use of the strong law of large numbers 



Wald 


(1949 


) and 


Kiefer and 



Wolfowitz ( 1956 ) . Since the arguments used to prove Theorem[T]are still valid if only the assumption 
Ai > is made, Kolmogorov's version of the strong law of large numbers can be applied to adapt 
their proof to the present case, imposing additional assumptions such as the Kolmogorov criterion 



E 



Var^) 



< oo 
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or conditions ensuring the validity of Kolmogorov's three-series theorem. Then, the result of Theo- 
rem[T]holds also in this more general setting. The case for non-zero 6j's can also dealt with similarly, 
but with long and tedious algebra. 

In the theory developed above, the number of pieces, B, in the broken-Pareto model is assumed 
to be known. The case of unknown B is, however, substantially more difficult. In fact, results in 
simpler settings such as the traditional "change in mean" scenario, in which segments of independent 
observations differ only by their levels, strong distributional assumptions become necessary to 
show consistency of an estimator for B. These typically require normality of the observations 
so that sharp tail estimates of the supremum of certain Gaussian processes are available; e.g., 
see Yao (1988). These techniques have also been exploited in Aue and Lee (2011) for image 



segmentation purposes. However, in the current context of the more complex broken-Pareto model, 
these arguments are not applicable and in fact it seems infeasible to derive theoretical properties 
under a set of practically relevant assumptions. 



8 Concluding Remarks 

We provide a coherent statistical procedure for selecting the number and orientation of "pieces" in 
an assumed piecewise linear log N— log S relationship. Our framework allows astrophysicists to use a 
principled approach to reliably select the model order B, and for parameter estimation via maximum 
likelihood estimation in a numerically challenging context. To the best of our knowledge, this is 
the first statistically rigorous procedure developed for solving this important scientific problem. 
-R-codes implementing the proposed procedure can be obtained from the authors. 



A Technical Details 



To prove Theorem [TJ the five assumptions made in Section 2 of Kiefer and Wolfowitz ( 1956 ) need 
to be verified. This is done in the following. 

Assumption 1. It is required that f(y; 0) is a density with respect to a a-finite measure /i on a 
Euclidean space of which y is the generic point. 

Proof. This condition is satisfied since the underlying distribution is discrete. □ 
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Define a metric on the space G by setting 

B B 
S(0i, 62) = I arctan/?ij — arctan/32j| + | arctanrij — arctanr2 j j 



Following Kiefer and Wolfowitz (1956), the parameter space is compactified by defining to be the 
completion of by adding all the limits of its Cauchy sequences in the sense of the above metric. 
Unless otherwise mentioned, all limits involving 9 are understood to be with respect to 5. The 



Euclidean norm is denoted by | • \ e- To verify the next assumption of Kiefer and Wolfowitz (1956), 
two auxiliary lemmas are introduced. 

Lemma 1. For sufficiently large (3j and a fixed y £ No, we have 

PjiATj)* ( ATl+1 p-to-^dt < 2(AT j ) y e- Ar J, 

JATj 



where r G 6. 

rA 



Proof. Note that ^(Atj)^ J^ j+1 P'^^e^dt < Pj(ATj)^T(y - Pj,A Tj ). Thus, by Theorem 2.2 



of Borwein and Chan (2009), for a sufficiently large j3j and a fixed y € No, 



r(y-^A Tj )< „j a 

y Pi 

and consequently pj(Arj)^T(y - faATj) < 2(Arj) y eT AT i . □ 
Lemma 2. If\r\ E < 00, Yvca^^^ f(y;9) exists. 

Proof. Note that if \P\e 00, there exists a j such that (3j — > 00. We focus on that one particular 
j. Let gj(y; 9) = ^(Ar^i f^ j+1 t y ~^ :i ~ 1 e~ t dt. In order to show that the limit of / exists, we only 
have to show that the limit of g exists (since it generalizes to any j with f3j — > 00). Note that, 
instead of considering gj, we look at hj(y;9) = loggj(y;9). We define hj j i(y;9) = log{(3 j(Arj) 13 ^ } 
and h j>2 (y; 9) = log f£j +1 t^-^dt. Then we have 

dhj{y-9) = dh jt i(y,e) dh ja (y;9) 

d/3j d/3j d(3j 

f^ +1 ty-^e-\io g t)dt 



^ + log(^.)} + 



d%,i(y,9) 



d$j P) 
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Note that d 2 h jt2 (y,9)/dl3] = Var(log(T)), where T is a random variable with density 



t y-ft-i e -t 



Thus d 2 hj^(y', 0)/d$j > and dhj^iy; 0)/d/3j is increasing with respect to j3j. It follows then that 
dhj^iy, 6)/df3j is bounded from above since hj is bounded from above by Lemmajl] Consequently, 
]hnp j -+ oo dhj ) 2{y;0)/dPj exists. □ 

Assumption 2 (Continuity Assumption). It is possible to extend the definition of f(y;9) so that 
the range of 9 will be and so that, for any {6i} and 9* in 0, 9i — >■ 9* implies 

f(y,0)^f(y,e*) 

except perhaps on a set of y whose probability is according to the probability density f(y;0o). 
(The exceptional y-set may depend on 9* and f(y; 9*) need not be a probability density function.) 

Proof. First, f(y; 9) is continuous with respect to 9 S and thus / automatically fulfills the above 
continuity requirement for 9 G 0. Define dQ = 0\0. Now, we will show that we can define 
f(y;9*), where 9* G dQ, as limg^g* f(y;9). It is thus only required to show the existence of 
this limit. Notice that limg^g* f(y;6) exists for boundary points 9 G dQ with \Q\e 7^ oo. The 
remaining case \9\e = oo can be separated into three sub-cases: (i) \@\e = oo and \t\e < oo, (ii) 
\P\e < oo and \t\e = oo, and (iii) \/3\e = oo and \t\e = oo. 

1. Suppose \/3\e = oo and \t\e < oo. From Lemma[2j lim^^^^ f(y; 9) exists. 

2. Suppose | (3\e < oo and \t\e = oo. This implies that there exists at least one j such that 
Tj = oo. Here, we have 

J Aa J Aa 

where < a < b. Taking the limit on the right-hand side, using the l'Hospital rule, it follows 
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that 

ff ■tv-to- 1 e-*dt A y-P -i T v e -Aa 
lim - n = lim J 



a "i a-s>oo Pj 

Since < (c/a)^- 1 < 1 for all < c < a, lim^^^ f(y; 6) exists. 
3. \(3\e = oo and \t\e = oo. The existence of limtgt^^ f(y;6) is basically implied by Lemma 

m 

The proof is complete. □ 
Assumption 3. For any in and any p > 0, 0, p) is a measurable function of y, where 

w(y,6,p) = sup/(y;0')> 
the supremum being taken over all 0' in for which 5(0, 6') < p. 

Proof. The statement is implied by the continuity of f(y; 0) with respect to 6 E 0. □ 
Assumption 4 (Identifiability Assumption). If in is different from Oq, then, for at least one 



f{y\6)dp^ / f(y\6 )dp, 

J — oo 

t/ie integral being over those y all of whose components < the corresponding of x. 

Proof. In the present case, p is the counting measure and thus, for all 9 G 0, if f(y\0) ^ f(y\6o) 
for at least one y 6 No, it fulfills the above assumption. This is obviously true for 6 € 0. Since 
00 £ ©) it is also easy to see that the above is true for 6 6 0. □ 

Assumption 5 (Integrability Assumption). For any in we have 

w(Y;0, P y 



limE 

p|0 



log 



/C;0o) 



< oo, 



where w is defined in Assumption^ 



Proof. Since f(y; 0) is continuous and bounded over 0, logu;(y; 6, p) is bounded from above. Now, 
we want to show that E| log f(Y; 6q)\ < oo. Since f(y;9o) is bounded from above, we only need 
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E{log(/(l r ; Oq)} > — oo, which can be shown as follows. Note that, for any 9 G G, 



E[log{/(r;0)}] =E 



B 
B 

= £>» 



log 



T 3-l 



T 3-l 



B 



J j=l 



log 



Y\ 

T(Y — Atj) 

r(y + 1) 



Here, 



nr-^Arj) _ r(y-/3„Ar j )r(y-/3 J ) _ r(y-&; 

" r(y -/?,-) r(y + i) Pj ' ij r(y + i) 



r(y + i) M . 

where Q is the regularized incomplete gamma function. Now, we state the asymptotic expansions of 
the regularized incomplete gamma function and the ratio of two gamma functions: When a — > oo, 



_ , A a- a - 1 ' 2 e a - z z a ( /l 

Q(a, z)<xl = <l + 0[- 

V27T I \a 



T(a + b) h _A (1 

—7 r oca < 1 + O - 

r(a + c) \ V a 



Applying these asymptotic expansions for large y, 



r(y + i) 



oc y 



1 l + O 



Thus, in order to bound E[log{r(y - Pj,ATj)/T(Y + 1)}], we only have to bound E{log(y)l {y > M} } 
away from oo for sufficiently large M. Here, we define logO x = 0. Now, we only have to consider 
the boundedness of Y^=m ^°^>{v) lu 1 f° r 3 = 1> • • • > B. It is bounded whenever /3j > 0, which is 
fulfilled by any 6 G 6. Thus, E{log(/(y ; 6 )} > -oo since 9q G 0. □ 



The statement of Theorem [T] follows now from Section 2 of Kiefer and Wolfowitz ( 1956 ) 
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